The purpose of this tutorial is to demonstrate that data collection and analysis can provide better outcomes and significantly increase the welfare for animals within shelters. Basic data collection in shelters is common but not used to its full potential. This tutorial takes shelter data from the Austin Animal Center and uses simple information collected from each animal to generate new insights on that can inform shelter policies. Within this tutorial some of the insights gained include which animals may be avoided by adopters, geographic info on where animals may be found, and which factors may be most important for an animal to be adopted into a loving home: the ultimate goal of any animal shelter.
While many shelters do keep some data on animals that pass through them for reporting purposes, they fail to use this data analytically. This may be due to bad quality and inconsistencies in keeping the data or a lack of software and expertise that keeps shelters from considering this data for analysis. Perhaps, the most common reason is inertia and resources. Many shelters spend their scant amount of money and time on animal maintenance. This gives them few resources to spend on data science which is an often expensive and time-consuming prospect.
Although many shelters would not consider data science a priority, the fact is that such analysis can reduce costs and increase animal welfare by redirecting resources to where they are needed. Analysis of animal data can also give a shelter metrics of progress. Even a simple number can allow the shelter to understand where its strong and weak points lie. This may be in bringing in fosters or maintaining the necessary living conditions of certain types of animals. Finally, data gives a clear empirical account of what is happening within a non-profit shelter. Such information can be an important tool for motivating executive officers to action and can give transparency which is vital in the NPO sector.
The datasets consist of shelter data from the Austin's Animal Center's shelter. It consists of two separate tables: one with the animals initial condition upon being received by the shelter and one with the animals outcome. The data is live and updated daily.
# #download the data as a csv file in the current directory
# # Since this file is large and we don't want to download data all the time
# # this code was run last at (12/2)
# import requests
# import os
# #give a big limit request to download all the shelter data
# intake_url = 'https://data.austintexas.gov/resource/wter-evkm.csv?$limit=99999999999999999'
# outcomes_url = 'https://data.austintexas.gov/resource/9t4d-g238.csv?$limit=99999999999999999'
# response = requests.get(intake_url)
# with open(os.path.join("intakes.csv"), 'wb') as f:
# f.write(response.content)
# response = requests.get(outcomes_url)
# with open(os.path.join("outcomes.csv"), 'wb') as f:
# f.write(response.content)
import pandas as pd
import numpy as np
import csv
from datetime import datetime
import urllib
import requests
#read downloaded csvs
intakes = pd.read_csv("intakes.csv")
outcomes = pd.read_csv("outcomes.csv")
intakes.head()
| animal_id | name | datetime | datetime2 | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A869978 | NaN | 2022-11-29T07:04:00.000 | 2022-11-29T07:04:00.000 | 14236 Fm 969 in Travis (TX) | Stray | Sick | Dog | Intact Male | 1 month | Cane Corso | Black |
| 1 | A869965 | NaN | 2022-11-28T19:00:00.000 | 2022-11-28T19:00:00.000 | 6704 Branrust Dr in Austin (TX) | Stray | Injured | Cat | Unknown | 2 years | Domestic Shorthair | Orange Tabby/White |
| 2 | A869969 | Biji | 2022-11-28T18:56:00.000 | 2022-11-28T18:56:00.000 | Austin (TX) | Owner Surrender | Normal | Cat | Neutered Male | 1 year | Domestic Longhair | Blue/White |
| 3 | A869967 | NaN | 2022-11-28T18:46:00.000 | 2022-11-28T18:46:00.000 | 1309 Walnut Avenue in Travis (TX) | Stray | Normal | Cat | Intact Female | 1 year | Domestic Shorthair | Brown Tabby/White |
| 4 | A869968 | NaN | 2022-11-28T18:32:00.000 | 2022-11-28T18:32:00.000 | 11210 Ranch To Market 222 in Austin (TX) | Owner Surrender | Normal | Dog | Unknown | 2 years | Miniature Schnauzer | Black |
Most of the columns are self-explanatory. Note that we have two datetime columns that appear identical.
outcomes.head()
| animal_id | name | datetime | monthyear | date_of_birth | outcome_type | outcome_subtype | animal_type | sex_upon_outcome | age_upon_outcome | breed | color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A869752 | NaN | 2022-11-29T05:00:00.000 | 2022-11-29T05:00:00.000 | 2012-11-23T00:00:00.000 | Died | At Vet | Cat | Intact Male | 10 years | Domestic Shorthair | Orange Tabby/White |
| 1 | A869596 | NaN | 2022-11-29T00:00:00.000 | 2022-11-29T00:00:00.000 | 2019-11-21T00:00:00.000 | Adoption | NaN | Dog | Intact Male | 3 years | Schipperke/Cardigan Welsh Corgi | Black |
| 2 | A861295 | *Gordon | 2022-11-28T19:12:00.000 | 2022-11-28T19:12:00.000 | 2022-05-05T00:00:00.000 | Adoption | NaN | Dog | Neutered Male | 6 months | Labrador Retriever/Pointer | Black/White |
| 3 | A869713 | Princess | 2022-11-28T18:57:00.000 | 2022-11-28T18:57:00.000 | 2017-11-28T00:00:00.000 | Return to Owner | NaN | Dog | Intact Female | 5 years | Pit Bull | Blue/White |
| 4 | A864284 | *Sparkle | 2022-11-28T18:45:00.000 | 2022-11-28T18:45:00.000 | 2020-03-23T00:00:00.000 | Adoption | NaN | Dog | Spayed Female | 2 years | Pit Bull | Brown Brindle |
Outcome types include:
print(intakes['datetime'].equals(intakes['datetime2']))
print((outcomes['datetime'].equals(outcomes['monthyear'])))
True True
The duplicated datetime columns in both datasets exactly the same. Lets remove them from each.
intakes.drop('datetime2', inplace= True, axis = 1)
outcomes.drop('monthyear', inplace= True, axis = 1)
#number of rows in each
print(len(intakes))
print(len(outcomes))
146162 146240
Oddly, there are slightly more outcomes than intakes. Shouldn't it be the other way around?
print(intakes[intakes.duplicated()].size)
print(outcomes[outcomes.duplicated()].size)
253 209
intakes[intakes['animal_id'] == 'A857222']
| animal_id | name | datetime | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 6823 | A857222 | Keanu | 2022-05-13T13:58:00.000 | 1800 S Ih 35 in Austin (TX) | Owner Surrender | Normal | Other | Intact Male | 8 months | Guinea Pig | White/Gray |
| 6824 | A857222 | Keanu | 2022-05-13T13:58:00.000 | 1800 S Ih 35 in Austin (TX) | Owner Surrender | Normal | Other | Intact Male | 8 months | Guinea Pig | White/Gray |
It also appears that some observations are entirely duplicated. Let's fix that.
#drop duplicates in both datasets
intakes.drop_duplicates(inplace = True)
outcomes.drop_duplicates(inplace = True)
#get the new length of the intakes data
len(intakes)
146139
#look at any duplicated animal ids
intakes[intakes['animal_id'].duplicated()].head(2)
| animal_id | name | datetime | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 256 | A869377 | Henry | 2022-11-17T13:35:00.000 | 810 E Slaughter Ln in Austin (TX) | Owner Surrender | Normal | Dog | Intact Male | 8 months | Dachshund/Doberman Pinsch | Tricolor |
| 518 | A845526 | Jackson | 2022-11-08T18:05:00.000 | Thunder Gulch Drive in Austin (TX) | Stray | Normal | Dog | Intact Male | 1 year | Bull Terrier | White |
#Look at the Henry's rows in the data
intakes[intakes['animal_id'] == 'A869377']
| animal_id | name | datetime | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | A869377 | Henry | 2022-11-28T17:17:00.000 | Travis (TX) | Owner Surrender | Normal | Dog | Neutered Male | 8 months | Dachshund/Doberman Pinsch | Tricolor |
| 256 | A869377 | Henry | 2022-11-17T13:35:00.000 | 810 E Slaughter Ln in Austin (TX) | Owner Surrender | Normal | Dog | Intact Male | 8 months | Dachshund/Doberman Pinsch | Tricolor |
While we have taken out duplicated observations, there are duplicated ids. This is because some animals have passed through the shelter multiple times.
#number of rows in each
print(len(intakes))
print(len(outcomes))
146139 146221
Ideally, for every time an animal has stayed and left the shelter, we want an observation in our dataset. This means we will have to eliminate any observations where animals that are still in the shelter and have an intake and no outcome or observations where animals were not properly tagged upon intake (and only have an outcome for their stay in the shelter). To achieve this, we first need to do more cleanup on our data.
Let's convert our datetime strings into something more usable. This will be necessary to compare and join our observations across the two tables.
intakes['datetime'] = pd.to_datetime(intakes['datetime'])
outcomes['datetime'] = pd.to_datetime(outcomes['datetime'])
We will also need to sort these dates for the merge of our two datasets.
# sort the data columns
outcomes.sort_values('datetime',inplace=True)
intakes.sort_values('datetime',inplace=True)
#add another column to preserve the outgoing date
outcomes['datetime2'] = outcomes['datetime']
We'll do an exact merge on the animal_id column and a closest match merge on datetime. This will match animals that have gone through the shelter multiple times. The forward direction specified is important as we want to match the right datetime, our outcomes, with the closest intake before it.
#merge our dataframes with an exact on id and partial on date
intakes_and_outcomes = pd.merge_asof( left = intakes, right = outcomes, on = ['datetime'], by = ['animal_id'], direction = 'forward')
#look at layout of the resulting dataframe
intakes_and_outcomes.head()
| animal_id | name_x | datetime | found_location | intake_type | intake_condition | animal_type_x | sex_upon_intake | age_upon_intake | breed_x | ... | name_y | date_of_birth | outcome_type | outcome_subtype | animal_type_y | sex_upon_outcome | age_upon_outcome | breed_y | color_y | datetime2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A521520 | Nina | 2013-10-01 07:51:00 | Norht Ec in Austin (TX) | Stray | Normal | Dog | Spayed Female | 7 years | Border Terrier/Border Collie | ... | Nina | 2006-09-07T00:00:00.000 | Return to Owner | NaN | Dog | Spayed Female | 7 years | Border Terrier/Border Collie | White/Tan | 2013-10-01 15:39:00 |
| 1 | A664237 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | 1 week | Domestic Shorthair Mix | ... | NaN | 2013-09-24T00:00:00.000 | Transfer | Partner | Cat | Unknown | 1 week | Domestic Shorthair Mix | Orange/White | 2013-10-01 10:44:00 |
| 2 | A664236 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | 1 week | Domestic Shorthair Mix | ... | NaN | 2013-09-24T00:00:00.000 | Transfer | Partner | Cat | Unknown | 1 week | Domestic Shorthair Mix | Orange/White | 2013-10-01 10:44:00 |
| 3 | A664235 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | 1 week | Domestic Shorthair Mix | ... | NaN | 2013-09-24T00:00:00.000 | Transfer | Partner | Cat | Unknown | 1 week | Domestic Shorthair Mix | Orange/White | 2013-10-01 10:39:00 |
| 4 | A664233 | Stevie | 2013-10-01 08:53:00 | 7405 Springtime in Austin (TX) | Stray | Injured | Dog | Intact Female | 3 years | Pit Bull Mix | ... | Stevie | 2010-09-30T00:00:00.000 | Euthanasia | Suffering | Dog | Intact Female | 3 years | Pit Bull Mix | Blue/White | 2013-10-01 15:33:00 |
5 rows × 21 columns
Our merge should match the appropriate outcome (if there is one) with the corresponding intake. Let's see if it worked by comparing Champ’s intakes and outcomes below with our resulting dataset.
#intakes for Champ
intakes[intakes['animal_id'] == 'A641748']
| animal_id | name | datetime | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 145526 | A641748 | Champ | 2013-10-13 14:06:00 | E Cesar Chavez St/Ih-35 in Austin (TX) | Stray | Normal | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | Black/Tan |
| 145430 | A641748 | Champ | 2013-10-15 11:19:00 | E Cesar Chavez/Ih-35 in Austin (TX) | Stray | Normal | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | Black/Tan |
| 137120 | A641748 | Champ | 2014-04-19 11:37:00 | Cesar Chavez & Chicon in Austin (TX) | Stray | Normal | Dog | Neutered Male | 5 years | Doberman Pinsch Mix | Black/Tan |
#outcomes for Champ
outcomes[outcomes['animal_id'] == 'A641748']
| animal_id | name | datetime | date_of_birth | outcome_type | outcome_subtype | animal_type | sex_upon_outcome | age_upon_outcome | breed | color | datetime2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 145577 | A641748 | Champ | 2013-10-13 14:43:00 | 2008-11-12T00:00:00.000 | Return to Owner | NaN | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | Black/Tan | 2013-10-13 14:43:00 |
| 145377 | A641748 | Champ | 2013-10-15 18:43:00 | 2008-11-12T00:00:00.000 | Return to Owner | NaN | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | Black/Tan | 2013-10-15 18:43:00 |
| 136687 | A641748 | Champ | 2014-04-25 16:26:00 | 2008-11-12T00:00:00.000 | Transfer | Partner | Dog | Neutered Male | 5 years | Doberman Pinsch Mix | Black/Tan | 2014-04-25 16:26:00 |
#resulting observations for Champ
intakes_and_outcomes[intakes_and_outcomes['animal_id'] == 'A641748']
| animal_id | name_x | datetime | found_location | intake_type | intake_condition | animal_type_x | sex_upon_intake | age_upon_intake | breed_x | ... | name_y | date_of_birth | outcome_type | outcome_subtype | animal_type_y | sex_upon_outcome | age_upon_outcome | breed_y | color_y | datetime2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 635 | A641748 | Champ | 2013-10-13 14:06:00 | E Cesar Chavez St/Ih-35 in Austin (TX) | Stray | Normal | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | ... | Champ | 2008-11-12T00:00:00.000 | Return to Owner | NaN | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | Black/Tan | 2013-10-13 14:43:00 |
| 731 | A641748 | Champ | 2013-10-15 11:19:00 | E Cesar Chavez/Ih-35 in Austin (TX) | Stray | Normal | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | ... | Champ | 2008-11-12T00:00:00.000 | Return to Owner | NaN | Dog | Neutered Male | 4 years | Doberman Pinsch Mix | Black/Tan | 2013-10-15 18:43:00 |
| 9041 | A641748 | Champ | 2014-04-19 11:37:00 | Cesar Chavez & Chicon in Austin (TX) | Stray | Normal | Dog | Neutered Male | 5 years | Doberman Pinsch Mix | ... | Champ | 2008-11-12T00:00:00.000 | Transfer | Partner | Dog | Neutered Male | 5 years | Doberman Pinsch Mix | Black/Tan | 2014-04-25 16:26:00 |
3 rows × 21 columns
Looks good! We now have both the intake of an animal and its corresponding outtake paired in the same row.
However, we also have some intake observations that did not match with an outcome. Using dropna() we can remove these observations that tell us nothing about the result or duration of an animal's stay.
#remove observations that did not match to an outcome
intakes_and_outcomes.dropna(axis = 0, how = 'all', subset = ['name_y', 'date_of_birth',
'outcome_type', 'outcome_subtype', 'animal_type_y', 'sex_upon_outcome',
'age_upon_outcome', 'breed_y', 'color_y', 'datetime2'], inplace=True )
The merge operation renames columns that appear in both datasets. We will now clear redundant info in an observation and rename columns to something more meaningful.
#drop redundant columns
#this makes the easy the animals name, type, breed, and color don't change
intakes_and_outcomes.drop(['name_y','animal_type_y','breed_y','color_y'], axis = 1, inplace= True)
#rename columns
intakes_and_outcomes.rename(columns={"name_x":"name", "breed_x":"breed", "animal_type_x": "animal_type", "color_x": "color", "datetime2": "release_date", "datetime": "intake_date"}, inplace = True)
#resulting dataframe
intakes_and_outcomes.head()
| animal_id | name | intake_date | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | date_of_birth | outcome_type | outcome_subtype | sex_upon_outcome | age_upon_outcome | release_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A521520 | Nina | 2013-10-01 07:51:00 | Norht Ec in Austin (TX) | Stray | Normal | Dog | Spayed Female | 7 years | Border Terrier/Border Collie | White/Tan | 2006-09-07T00:00:00.000 | Return to Owner | NaN | Spayed Female | 7 years | 2013-10-01 15:39:00 |
| 1 | A664237 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | 1 week | Domestic Shorthair Mix | Orange/White | 2013-09-24T00:00:00.000 | Transfer | Partner | Unknown | 1 week | 2013-10-01 10:44:00 |
| 2 | A664236 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | 1 week | Domestic Shorthair Mix | Orange/White | 2013-09-24T00:00:00.000 | Transfer | Partner | Unknown | 1 week | 2013-10-01 10:44:00 |
| 3 | A664235 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | 1 week | Domestic Shorthair Mix | Orange/White | 2013-09-24T00:00:00.000 | Transfer | Partner | Unknown | 1 week | 2013-10-01 10:39:00 |
| 4 | A664233 | Stevie | 2013-10-01 08:53:00 | 7405 Springtime in Austin (TX) | Stray | Injured | Dog | Intact Female | 3 years | Pit Bull Mix | Blue/White | 2010-09-30T00:00:00.000 | Euthanasia | Suffering | Intact Female | 3 years | 2013-10-01 15:33:00 |
Let's also check for NA observations.
#fake observation
intakes_and_outcomes[intakes_and_outcomes['age_upon_intake'].isna()]
| animal_id | name | intake_date | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | date_of_birth | outcome_type | outcome_subtype | sex_upon_outcome | age_upon_outcome | release_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 124099 | A830333 | Test | 2021-03-03 18:30:00 | 1234 Test Street in Austin (TX) | Stray | Normal | Dog | NaN | NaN | Kuvasz | Unknown | 2022-10-12T14:51:00.000 | Return to Owner | NaN | NaN | NaN | 2022-10-12 14:51:00 |
#remove fake observation
intakes_and_outcomes = intakes_and_outcomes.drop(124099)
intakes_and_outcomes[intakes_and_outcomes['age_upon_outcome'].isna()].head()
| animal_id | name | intake_date | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | date_of_birth | outcome_type | outcome_subtype | sex_upon_outcome | age_upon_outcome | release_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 137857 | A854611 | NaN | 2022-04-06 15:16:00 | 2500 Matthews Drive in Austin (TX) | Wildlife | Sick | Other | Unknown | 2 years | Raccoon | Gray/Black | 2020-04-06T00:00:00.000 | Euthanasia | Rabies Risk | Unknown | NaN | 2022-04-06 17:16:00 |
| 138149 | A855204 | Bat | 2022-04-14 19:10:00 | Austin (TX) | Wildlife | Normal | Other | Unknown | 1 year | Bat | Brown | 2021-04-14T00:00:00.000 | Euthanasia | Rabies Risk | Unknown | NaN | 2022-04-15 07:43:00 |
| 138574 | A856021 | NaN | 2022-04-27 12:04:00 | 43 Rainey St in Austin (TX) | Wildlife | Injured | Other | Unknown | 2 years | Bat | Brown | 2020-04-27T00:00:00.000 | Euthanasia | Rabies Risk | Unknown | NaN | 2022-04-27 12:37:00 |
| 138652 | A856155 | NaN | 2022-04-29 10:24:00 | 300 West Riverside Dr in Austin (TX) | Wildlife | Normal | Other | Unknown | 2 years | Bat | Black | 2020-04-29T00:00:00.000 | Euthanasia | Rabies Risk | Unknown | NaN | 2022-04-29 14:54:00 |
| 138740 | A856301 | NaN | 2022-05-02 11:11:00 | 500 East Riverside Drive in Austin (TX) | Wildlife | Injured | Other | Unknown | 1 year | Bat | Brown | 2021-05-02T00:00:00.000 | Euthanasia | NaN | Unknown | NaN | 2022-05-02 13:05:00 |
Interestingly, it seems that age of animals that were euthanized was not logged into outcomes by the shelter.
print(len(intakes_and_outcomes[intakes_and_outcomes['outcome_type'].isna()]))
intakes_and_outcomes[intakes_and_outcomes['outcome_type'].isna()].head()
25
| animal_id | name | intake_date | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | date_of_birth | outcome_type | outcome_subtype | sex_upon_outcome | age_upon_outcome | release_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4886 | A671017 | NaN | 2014-01-17 12:46:00 | 10105 Falcon Ridge Dr. in Travis (TX) | Wildlife | Sick | Other | Unknown | 1 year | Fox Mix | Brown/Black | 2013-01-17T00:00:00.000 | NaN | NaN | Unknown | 1 year | 2014-01-20 08:27:00 |
| 7011 | A674209 | NaN | 2014-03-09 16:38:00 | 4210 W. Braker in Austin (TX) | Wildlife | Normal | Other | Unknown | 0 years | Bat Mix | Brown | 2014-03-09T00:00:00.000 | NaN | NaN | Unknown | 0 years | 2014-03-09 18:15:00 |
| 16116 | A686025 | NaN | 2014-08-15 18:48:00 | 12901 N Ih 35 #1100 in Austin (TX) | Wildlife | Normal | Other | Unknown | 1 year | Bat Mix | Brown | 2013-08-15T00:00:00.000 | NaN | NaN | Unknown | 1 year | 2014-08-16 08:35:00 |
| 116626 | A815725 | NaN | 2020-03-24 10:57:00 | 321 West Ben White in Austin (TX) | Wildlife | Normal | Other | Unknown | 1 year | Bat | Black | 2019-03-24T00:00:00.000 | NaN | NaN | Unknown | 1 year | 2020-03-24 13:42:00 |
| 123015 | A812591 | Cookie | 2020-12-29 12:32:00 | 1907 Bitter Creek Drive in Austin (TX) | Stray | Sick | Dog | Intact Female | 1 year | American Bulldog Mix | White/Black | 2019-06-26T00:00:00.000 | NaN | NaN | Spayed Female | 1 year | 2021-01-05 17:25:00 |
There are a few animals without outcome types. Let's remove them from our data as well.
#remove rows from dataframe
intakes_and_outcomes.dropna(subset=['outcome_type'], inplace= True)
intakes_and_outcomes[intakes_and_outcomes['sex_upon_outcome'].isna()]
| animal_id | name | intake_date | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | age_upon_intake | breed | color | date_of_birth | outcome_type | outcome_subtype | sex_upon_outcome | age_upon_outcome | release_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2375 | A667395 | Diego | 2013-11-17 13:15:00 | Pflugerville (TX) | Owner Surrender | Normal | Dog | NaN | 7 years | Dachshund | Brown Merle | 2006-11-17T00:00:00.000 | Return to Owner | NaN | NaN | 7 years | 2013-11-27 16:11:00 |
| 144836 | A705094 | Magna | 2022-10-12 13:07:00 | 11614N Ruffed Grouse Drive in Austin (TX) | Owner Surrender | Behavior | Dog | Spayed Female | 8 years | Staffordshire Mix | Blue/White | 2014-06-18T00:00:00.000 | Euthanasia | Rabies Risk | NaN | NaN | 2022-10-12 17:46:00 |
Let's drop Diego and Assume that Magna remained female and spayed.
#drop diego
intakes_and_outcomes = intakes_and_outcomes.drop(2375)
#update magna
intakes_and_outcomes.at[144836, 'sex_upon_outcome'] = 'Spayed Female'
We also need to convert the date of birth column to datetime for a later analysis.
intakes_and_outcomes['date_of_birth'] = pd.to_datetime(intakes_and_outcomes['date_of_birth'])
intakes_and_outcomes.drop(['age_upon_intake','age_upon_outcome'], axis = 1, inplace= True)
intakes_and_outcomes.head()
| animal_id | name | intake_date | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | breed | color | date_of_birth | outcome_type | outcome_subtype | sex_upon_outcome | release_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A521520 | Nina | 2013-10-01 07:51:00 | Norht Ec in Austin (TX) | Stray | Normal | Dog | Spayed Female | Border Terrier/Border Collie | White/Tan | 2006-09-07 | Return to Owner | NaN | Spayed Female | 2013-10-01 15:39:00 |
| 1 | A664237 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | Domestic Shorthair Mix | Orange/White | 2013-09-24 | Transfer | Partner | Unknown | 2013-10-01 10:44:00 |
| 2 | A664236 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | Domestic Shorthair Mix | Orange/White | 2013-09-24 | Transfer | Partner | Unknown | 2013-10-01 10:44:00 |
| 3 | A664235 | NaN | 2013-10-01 08:33:00 | Abia in Austin (TX) | Stray | Normal | Cat | Unknown | Domestic Shorthair Mix | Orange/White | 2013-09-24 | Transfer | Partner | Unknown | 2013-10-01 10:39:00 |
| 4 | A664233 | Stevie | 2013-10-01 08:53:00 | 7405 Springtime in Austin (TX) | Stray | Injured | Dog | Intact Female | Pit Bull Mix | Blue/White | 2010-09-30 | Euthanasia | Suffering | Intact Female | 2013-10-01 15:33:00 |
Since we have the animal’s date of birth, the 'age_upon_intake' and 'age_upon_outcome' columns are redundant information. Let's get rid of them.
Now that our dataframe is clean and we have our paired observations, we will perform some visual exploration of the data. Most of the features of the resulting dataset are categorical in nature. We will explore the relative size of these categories using bar plots and provide simple analysis to understand what features may be useful, interesting, or surprising.
import matplotlib as plt
import matplotlib.colors as mcolors
# fig, axis = plt.subplots()
ax = intakes_and_outcomes['animal_type'].value_counts().plot(kind = 'bar', color= mcolors.TABLEAU_COLORS, figsize=(5, 5))
ax.set_title("Shelter Animals by Type",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Number of Animals', fontsize=20)
ax.set_xlabel('Type of Animal', fontsize=20)
ax.tick_params(axis='x', rotation = 45)
Mostly cats and dogs pass through the shelter, as expected. There's a smaller number of birds that have been hosted by the shelter and even some livestock. Our future analysis will focus mostly on the dogs and cats that dwarf the observations in the set and use the most shelter resources.
What about the categories for breed?
intakes_and_outcomes['breed'].value_counts()
Domestic Shorthair Mix 32563
Domestic Shorthair 12791
Pit Bull Mix 9157
Labrador Retriever Mix 7701
Chihuahua Shorthair Mix 6592
...
Dachshund/Wire Hair Fox Terrier 1
Labrador Retriever/Bullmastiff 1
Collie Smooth/Beagle 1
Whippet/Plott Hound 1
Schipperke/Cardigan Welsh Corgi 1
Name: breed, Length: 2790, dtype: int64
There's a huge range of breeds in this dataset. Let’s look only at the top 20 or so.
ax = intakes_and_outcomes['breed'].value_counts()[:25].plot(kind = 'bar', color= mcolors.TABLEAU_COLORS, figsize=(10, 7))
ax.set_title("Top Shelter Breeds",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Breed', fontsize=18)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 85, labelsize=15)
Looks like breeds follows a power law distribution with a handful of very popular breeds and an extremely long tail of niche eclectic mix breeds. The Domestic Shorthair Mix seems to be a catchall for most cats that come into the shelter. This is followed by some easily identified dog breeds in the Pit Bull, Lab, and Chihuahua. Strangely, bat and "bat mix" follow this up in some of the top spots.
What about intake type and condition?
ax = intakes_and_outcomes['intake_type'].value_counts().plot(kind = 'bar', color= mcolors.TABLEAU_COLORS, figsize=(5, 5))
ax.set_title("Type of Animal Intakes",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Type', fontsize=15)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 45, labelsize=15)
The most common type of intakes are strays followed by surrenders. This information isn't too surprising. Luckily, the Abandoned category seems fairly small relative to the others.
ax = intakes_and_outcomes['intake_condition'].value_counts().plot(kind = 'bar', color= mcolors.TABLEAU_COLORS, figsize=(7, 7))
ax.set_title("Condition of Animal at Intake",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Condition', fontsize=20)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 75, labelsize=15)
Condition represents another exponential distribution. This time the exponential has a large lambda value and is heavily weighted for normal intakes. Most categories are self-explanatory. Panleuk refers to a virus that infects cats and agonal refers to dying animals. Thankfully, these are the smallest categories.
fig, axis = plt.pyplot.subplots(1, 2)
fig.suptitle('Animal Sex at:', size = 30)
fig.tight_layout(pad=1.0)
#plt.pyplot.subplots_adjust(left=0.1,bottom=0.1,right=0.9,top=0.9,wspace=0.4,hspace=0.4)
ax = intakes_and_outcomes['sex_upon_intake'].value_counts().plot(kind = 'bar', color= ['xkcd:blue','xkcd:orange','xkcd:kelly green','xkcd:red', 'xkcd:purple']
, figsize=(10, 5), ax=axis[0])
ax.set_title("Intake",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Sex', fontsize=20)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 75, labelsize=15)
ax = intakes_and_outcomes['sex_upon_outcome'].value_counts().plot(kind = 'bar', color= ['xkcd:kelly green','xkcd:red','xkcd:blue','xkcd:orange', 'xkcd:purple'], figsize=(10, 5),
ax=axis[1])
ax.set_title("Outtake",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Sex', fontsize=20)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 75, labelsize=15)
It appears there's slightly more male animals that come into the shelter versus female animals. Furthermore, the relative amounts for intact versus neutered or spayed animals switch positions upon outtake. Let’s look a little bit more into the number of animals that are fixed by the shelter.
intact_intakes = (intakes_and_outcomes['sex_upon_intake'] == 'Intact Male').sum() + \
(intakes_and_outcomes['sex_upon_intake'] == 'Intact Female').sum()
print( f"{intact_intakes} animals have come into the shelter intact")
fixed_animals = ((intakes_and_outcomes['sex_upon_intake'] == 'Intact Male') & (intakes_and_outcomes['sex_upon_outcome'] == 'Neutered Male') | (intakes_and_outcomes['sex_upon_intake'] == 'Intact Female') & (intakes_and_outcomes['sex_upon_outcome'] == 'Spayed Male')).sum()
print( f"{fixed_animals} of these animals are fixed")
percent_fixed = (fixed_animals/intact_intakes) * 100
print( f" {round(percent_fixed,2)} percent animals that pass through are fixed by the shelter")
93945 animals have come into the shelter intact 29580 of these animals are fixed 31.49 percent animals that pass through are fixed by the shelter
ax = intakes_and_outcomes['outcome_type'].value_counts().plot(kind = 'bar', color= mcolors.TABLEAU_COLORS, figsize=(5, 5))
ax.set_title("Outcome Type",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Type of Outcome', fontsize=20)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 75, labelsize=15)
Most animals are adopted out or transferred. Rarely, a few animals are lost or stolen. The Austin animal shelter is a "no-kill" shelter and maintains a "no-kill" policy that is discussed below. Let's look at their rate of euthanasia.
euthanasia = (intakes_and_outcomes['outcome_type'] == 'Euthanasia').sum()
euthanasia_rate = euthanasia/len(intakes_and_outcomes) * 100
print( f" {round(euthanasia_rate,2)} percent animals that pass through are euthanized")
6.48 percent animals that pass through are euthanized
Theres also some more detailed subtypes for outcomes. Let's quickly graph those as well.
ax = intakes_and_outcomes['outcome_subtype'].value_counts().plot(kind = 'bar', color= mcolors.TABLEAU_COLORS, figsize=(12, 5))
ax.set_title("Outcome Subype",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Subtype', fontsize=20)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 75, labelsize=15)
Subtypes are dominated by transfers to partnering shelters. Although there are some animal fosters as well. Overall, subtype will not be particularly useful for this analysis.
Let's also look at animal colors. This info can be an important factor in considering animal adoptions as we will explore below.
intakes_and_outcomes['color'].value_counts()
Black/White 15066
Black 12016
Brown Tabby 8381
Brown 6099
White 5111
...
Cream/Blue Point 1
Torbie/Silver Tabby 1
Lynx Point/Blue 1
Gray Tabby/Orange 1
Blue/Blue Tick 1
Name: color, Length: 621, dtype: int64
This feature also contains an overwhelming number of categories. We will again limit by using only 35 color categories in our graph.
ax = intakes_and_outcomes['color'].value_counts()[:35].plot(kind = 'bar', color= mcolors.TABLEAU_COLORS, figsize=(20, 5))
ax.set_title("Color of Animal",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_xlabel('Color', fontsize=20)
ax.set_ylabel('Number of Animals', fontsize=20)
ax.tick_params(axis='x', rotation = 85, labelsize=15)
This feature is fairly loosely categorized with 622 different colors. Black, White, and Brown keep the top spots.
Okay, now that we've looked at most of the features, let's look at some time series information.
ax = intakes_and_outcomes['intake_date'].groupby( intakes_and_outcomes['intake_date'].dt.year).count().plot( kind = 'bar', figsize=(10, 6), position=0, width = 0.4)
intakes_and_outcomes['release_date'].groupby( intakes_and_outcomes['release_date'].dt.year).count().plot( kind = 'bar', position = 1, color = 'xkcd:orange', width = 0.4)
ax.set_title("Yearly Intakes and Outcomes",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Number of Animals', fontsize=20)
ax.set_xlabel('Year', fontsize=20)
ax.tick_params(axis='x', rotation = 75, labelsize=15)
ax.legend(labels = ['intakes','outcomes'])
<matplotlib.legend.Legend at 0x2dce3495fa0>
There hasn't been a significant trend across initial years. While it appears that 2013 is much lower, this is due to the fact that the Austin shelter's data collection policy started near the end of that year. The other years are relatively uniform except for the clear influence that the pandemic had as it has since lowered intakes by almost half of their previous values.
#intakes_and_outcomes['intake_date'].groupby( intakes_and_outcomes['intake_date'].dt.month).count().plot( kind = 'bar')
ax = intakes_and_outcomes['intake_date'].groupby( intakes_and_outcomes['intake_date'].dt.month).count().plot( kind = 'bar', figsize=(10, 6), position=0, width = 0.4)
intakes_and_outcomes['release_date'].groupby( intakes_and_outcomes['release_date'].dt.month).count().plot( kind = 'bar', position = 1, color = 'xkcd:orange', width = 0.4)
ax.set_title("Monthly Intakes and Outcomes",fontdict={'fontsize':'30'})
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Number of Animals', fontsize=20)
ax.set_xlabel('Month', fontsize=20)
ax.tick_params(axis='x', rotation = 75, labelsize=15)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
ax.set_xticklabels(months)
ax.legend(labels = ['intakes','releases'])
<matplotlib.legend.Legend at 0x2dc5e938640>
There is a seasonal pattern here. Intakes and outcomes occur more in the warmer months and drop off at the end/beginning of the year.
The dataset contains data on where each animal has been found. Unfortunately, this information is encoded in plain English as either an address or general location. As with most cases in data-science, this information is useless in a non-numerical form. Fortunately, there are numerous APIs that support geocoding. These services can give latitude and longitude information from a plain text address or other written location. However, they are severely limited by the number of requests one can make. The industry standard is around 2500 free requests for an entire month. We will be using the U.S. Census Bureau's bulk geocoder as it allows up to 10,000. More information on the geocoder is below.
First, we will randomly sample 9000 observations from our data frame to fit under the limit for a single request. We will need to do some cleaning of the provided address to make it suitable for the geocoder. Therefore, we will only sample observations that contain '(TX)' in their address field. This is the standard way the shelter has encoded this location data, but some observations are encoded incorrectly.
#set a seed and sample 9000 observations from our dataset
np.random.seed(0)
address_sample = np.random.choice(intakes_and_outcomes[intakes_and_outcomes['found_location'].str.contains('(TX)', regex=False)]['found_location'].unique(), size = 9000, replace=False )
#the given sample
address_sample
array(['4511 Avenue B in Austin (TX)', '1004 Hermitage Dr in Austin (TX)',
'Hunters Bend And 969 in Austin (TX)', ...,
'Parliment Drive And Garden View Drive in Austin (TX)',
'2800 Vernon Avenue in Austin (TX)',
'4372 N Pace Bend Drive in Travis (TX)'], dtype=object)
We will also grab the indices that each sample had in the original dataset. We will need this info to restitch the data we get back from the geocoder with our original dataset.
indices = [intakes_and_outcomes[intakes_and_outcomes['found_location'] == x].index[0] for x in address_sample]
#turn the arrays into dataframes
address_sample = pd.DataFrame(address_sample, columns=['address'])
indices = pd.DataFrame(indices, columns= ['original_index'])
The geocoder requires that the location data given consists of street address with the option of including either city and state or zip code. Since we do not have a zip code, we will need to use city and state info. All of this information is encoded within the original dataset as a single string. We can extract the relevant portions by using a regex.
#extract addresses by custom regex
geo_info = address_sample['address'].str.extract( r'^(\d{0,5}.*)[#\d]* in ([\w|\s]*) \(TX\)$')
geo_info[geo_info[0].isna()]
| 0 | 1 | |
|---|---|---|
| 2013 | NaN | NaN |
address_sample[~ address_sample['address'].str.contains(" in ")]
| address | |
|---|---|
| 2013 | Jonestown (TX) |
Only one observation doesn't match the regex because they do not have "in" within the address field. We will delete this from our sample.
geo_info.dropna(how= 'any', inplace= True)
#remove from the index data manually
indices.drop(2013, inplace=True)
geo_info.rename({0:"Street", 1: "City"}, axis = 1, inplace=True)
#add the requied state field and zipcode field
geo_info['State'] = 'Texas'
geo_info['Zip'] = ' '
geo_info
| Street | City | State | Zip | |
|---|---|---|---|---|
| 0 | 4511 Avenue B | Austin | Texas | |
| 1 | 1004 Hermitage Dr | Austin | Texas | |
| 2 | Hunters Bend And 969 | Austin | Texas | |
| 3 | Burnet And 183 | Austin | Texas | |
| 4 | 814 Sahara Avenue | Austin | Texas | |
| ... | ... | ... | ... | ... |
| 8995 | Gilwell Drive And Diva Circle | Del Valle | Texas | |
| 8996 | 700 E. 8Th St. | Austin | Texas | |
| 8997 | Parliment Drive And Garden View Drive | Austin | Texas | |
| 8998 | 2800 Vernon Avenue | Austin | Texas | |
| 8999 | 4372 N Pace Bend Drive | Travis | Texas |
8999 rows × 4 columns
Now that the data is in the appropriate format, we can save it as a csv and upload it to the geocoder.
#save as a csv
geo_info.to_csv("locations.csv", header=False)
The geocoder can be accessed from a browser where a csv can be uploaded manually. Its webpage on the Census Bureau's website is located here: https://geocoding.geo.census.gov/geocoder/locations/addressbatch?form. While, it does match fairly well, it often does not have an existing address in its database. In practice it found a match about 60% of the time. Despite its inability to match, it still outweighs many options due to the large amount of requests it allows, particularly at one time and that it is completely free!
Now let's get our resulting data.
lat_long = pd.read_csv('GeocodeResults.csv', header = None, names=list('01234567'))
lat_long.head(5)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 4970 | Woodmont And Forest Trail, Austin, Texas, | Tie | NaN | NaN | NaN | NaN | NaN |
| 1 | 3640 | 6801 Northeast Drive, Austin, Texas, | No_Match | NaN | NaN | NaN | NaN | NaN |
| 2 | 4971 | 17200 Hamilton Point Cir, Manor, Texas, | Match | Exact | 17200 HAMILTON POINT CIR, MANOR, TX, 78653 | -97.54888061999998,30.33730230200007 | 653233909.0 | R |
| 3 | 2306 | 604 East 45Th 1/2 Street, Austin, Texas, | No_Match | NaN | NaN | NaN | NaN | NaN |
| 4 | 3638 | 1700 Mira Vista Drive, Travis, Texas, | No_Match | NaN | NaN | NaN | NaN | NaN |
#rename columns
lat_long.rename({'0':'id','1':'address','2':'if_matched','3':'match_type','4':'found_match','5':'lat_long','6':'T_ID','7':'T_IDS'}, axis = 1, inplace=True)
#set index and sort values by index
lat_long = lat_long.set_index('id').sort_index()
#remove observations that did not match
lat_long = lat_long[lat_long['if_matched'] == 'Match']
lat_long.head()
| address | if_matched | match_type | found_match | lat_long | T_ID | T_IDS | |
|---|---|---|---|---|---|---|---|
| id | |||||||
| 0 | 4511 Avenue B, Austin, Texas, | Match | Exact | 4511 AVE B, AUSTIN, TX, 78751 | -97.73092922699999,30.310281451000037 | 63931555.0 | R |
| 1 | 1004 Hermitage Dr, Austin, Texas, | Match | Exact | 1004 HERMITAGE DR, AUSTIN, TX, 78753 | -97.68915203299997,30.348316271000044 | 63932360.0 | L |
| 4 | 814 Sahara Avenue, Austin, Texas, | Match | Exact | 814 SAHARA AVE, AUSTIN, TX, 78745 | -97.78608334199998,30.204970716000048 | 63949351.0 | R |
| 7 | 1521 Reagan Hill Dr, Austin, Texas, | Match | Exact | 1521 REAGAN HILL DR, AUSTIN, TX, 78752 | -97.68972254599998,30.32642259100004 | 64010298.0 | R |
| 10 | 1071 Clayton Street, Austin, Texas, | Match | Non_Exact | 1071 CLAYTON LN, AUSTIN, TX, 78723 | -97.70308496099994,30.31891117200007 | 63932562.0 | R |
#change names of columns, again
lat_long = lat_long['lat_long'].str.split(",",expand=True).rename({1:'latitude',0:'longitude' }, axis = 1)
#reorder columns to the appropriate location
lat_long = lat_long[['latitude','longitude']]
#add original indices back into the data
lat_long = pd.merge(lat_long, pd.DataFrame(indices), left_index=True, right_index=True)
lat_long
| latitude | longitude | original_index | |
|---|---|---|---|
| 0 | 30.310281451000037 | -97.73092922699999 | 119468 |
| 1 | 30.348316271000044 | -97.68915203299997 | 14674 |
| 4 | 30.204970716000048 | -97.78608334199998 | 136359 |
| 7 | 30.32642259100004 | -97.68972254599998 | 68884 |
| 10 | 30.31891117200007 | -97.70308496099994 | 103937 |
| ... | ... | ... | ... |
| 8989 | 30.298299533000034 | -97.50874723399994 | 80202 |
| 8994 | 30.394910753000033 | -97.72696788199994 | 29985 |
| 8996 | 30.268001170000048 | -97.73518045399999 | 49401 |
| 8997 | 30.287375241000063 | -97.65955992199997 | 59022 |
| 8998 | 30.309262963000037 | -97.67862487299999 | 16047 |
5519 rows × 3 columns
Now, we will create a dataframe with our new lat/long data appended to our original dataframe. This lets us see not only where animals are being found but which animals they are.
#merge with original dataframe
intakes_and_outcomes_locations = pd.merge(intakes_and_outcomes, lat_long, how= 'right', left_index = True, right_on= 'original_index')
intakes_and_outcomes_locations = intakes_and_outcomes_locations.set_index('original_index').sort_index()
Using purely the geographic info, we can create a heatmap for the location of found animals. To do this we will be using the folium library's HeatMap plugin. To do so, we create a new dataframe from our geographical data and pass it into the map.
#create a new dataframe for the heatmap
heat_df = intakes_and_outcomes_locations[['latitude','longitude']].reset_index().drop('original_index', axis = 1)
#set the weight of each point on the map
heat_df['weight'] = 1
heat_df
| latitude | longitude | weight | |
|---|---|---|---|
| 0 | 30.267382418000068 | -97.69056104099997 | 1 |
| 1 | 30.464816573000064 | -97.67362911299995 | 1 |
| 2 | 30.41425246700004 | -97.70069719099996 | 1 |
| 3 | 30.47623857600007 | -97.97472081599994 | 1 |
| 4 | 30.23109089700006 | -97.79556176299997 | 1 |
| ... | ... | ... | ... |
| 5514 | 30.20240730200004 | -97.81233149499997 | 1 |
| 5515 | 30.25823648800008 | -97.79052487999996 | 1 |
| 5516 | 30.17401351500007 | -97.80962586699997 | 1 |
| 5517 | 30.286837355000046 | -97.63812391299996 | 1 |
| 5518 | 30.31101224200006 | -97.73325801899995 | 1 |
5519 rows × 3 columns
#import statements
import folium
from folium.plugins import HeatMap
Finally, we can create a heatmap.
#create and display heatmap overlayed on Austin, Texas
hm = folium.Map(location=[30.27679, -97.74907], zoom_start=14)
HeatMap(heat_df, blur = 15, max_opacity=0.1).add_to(folium.FeatureGroup(name='Heat Map').add_to(hm))
folium.LayerControl().add_to(hm)
hm
This heatmap weights each observation the same. However, animal types or breeds that are at significant risk or pose danger to the environment can be weighted more heavily to redirect shelter staff or volunteers to "hotter" areas. Heatmaps using shelter data can also map and help track specific populations of homeless animals and support shelters in increasing the welfare of animals across their city and county.
intakes_and_outcomes_locations['intake_type'].value_counts()
Stray 4641 Wildlife 408 Public Assist 277 Owner Surrender 174 Abandoned 15 Euthanasia Request 4 Name: intake_type, dtype: int64
Let's create another map. This time we will focus on type of intake. These include stray animals, those surrendered by owner, wildlife, public assistance, and abandoned or euthanasia requests. We will define a function that will use these types to determine color for markers then plot them on a map.
def marker_color(x):
if x[1][0] == "Stray":
return 'blue'
if x[1][0] == "Owner Surrender":
return 'green'
if x[1][0] == "Wildlife":
return 'red'
if x[1][0] == "Public Assist":
return 'purple'
if x[1][0] == "Abandoned":
return 'pink'
# orange will correspond to a Euthanasia Request
else:
return 'orange'
type_map = folium.Map(location=[30.27679, -97.74907], zoom_start=12)
for obs in intakes_and_outcomes_locations[["intake_type", "latitude", "longitude"]].iterrows():
#print(obs[1][2])
folium.Circle([obs[1][1], obs[1][2]], radius=2, color= marker_color(obs), fill = True, fill_color=marker_color(obs), fill_opacity=0.2).add_to(type_map)
type_map
The vast majority of points in our sample consists of stay animals. However, the red strip to the west shows locations where wildlife prefers to roam or, at least, where it comes into human contact. There is also a slight clustering of public assistance cases in downtown Austin. However, this may be a result of geographical data being tagged "Austin, Texas" with no additional address thus, defaulting to the center of the city. A shelter can use this plot to find out locations or communities most in need of public assistance or populations of wild animals. In either case, such a map can inform a shelter about potential problems and enable it to act on its source.
Form the maps above we can see that geographic information can give a significant edge to any animal shelter, particularly those looking to provide homes to stray animals. Being aware of where stray animals congregate can allow a non-profit to significantly reduce the number of companion animals suffering on the streets. This data can inform programs such as spay/neuter release (SNR) for cats which do not rehome strays but reduce the number of homeless kittens and severely limits population growth.
This location information can also apply beyond stray animals as well. For example, is a specific community having issues with keeping animals or a large number of owner surrenders? A low-cost food bank for animals or other programs can target systemic issues that keep owners from being able to care for their companions. This can give better outcomes to both animals and their owners alike at a fraction of the cost than it would to house these animals within a shelter.
Ultimately, geographic information should always be considered by a shelter. Many shelters significantly rely on the communities around them for support in their operations. Knowing what is happening, and where, within a community is vital. Analyzing where animals are coming from can give significant insight into the reasons animals are finding their way into a shelter. Targeting these causes directly can allow a non-profit to increase its ability be a force for positive change and do so more efficiently than before.
Let's take a look at the distribution of average time that an animal stays at the shelter. This will be our most important metric in determining information relating to adoptions. Length of stay can tell us which animals are adopted quickly and which may be considered less desirable be adopters.
(intakes_and_outcomes['release_date'] - intakes_and_outcomes['intake_date']).describe()
count 144689 mean 19 days 10:29:13.594122566 std 46 days 01:05:18.381857483 min 0 days 00:00:00 25% 1 days 20:57:00 50% 5 days 06:29:00 75% 16 days 22:53:00 max 1912 days 22:31:00 dtype: object
Let’s add this to our dataframe as well. Knowing how long an animal stays at the shelter can give us a significant amount of information.
intakes_and_outcomes['time_in_shelter'] = (intakes_and_outcomes['release_date'] - intakes_and_outcomes['intake_date'])
What about the animal that was in the shelter the longest?
longest_years = intakes_and_outcomes['time_in_shelter'].max().days/365
print( f"Longest time an animal has been at the shelter for is: \
{round(longest_years,0)} years! ")
Longest time an animal has been at the shelter for is: 5.0 years!
intakes_and_outcomes[ intakes_and_outcomes['time_in_shelter'] == intakes_and_outcomes['time_in_shelter'].max()]
| animal_id | name | intake_date | found_location | intake_type | intake_condition | animal_type | sex_upon_intake | breed | color | date_of_birth | outcome_type | outcome_subtype | sex_upon_outcome | release_date | time_in_shelter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41715 | A642712 | Patches | 2016-01-05 11:37:00 | Crownover St And Caleb Dr in Austin (TX) | Stray | Injured | Dog | Spayed Female | Pit Bull Mix | White/Gray Tiger | 2009-11-28 | Adoption | Foster | Spayed Female | 2021-04-01 10:08:00 | 1912 days 22:31:00 |
Looks like Patches had a hard time but was eventually released to a foster.
According to Texas Monthly, Patches was eventually adopted by her foster parent last year! More info here: https://www.texasmonthly.com/being-texan/austin-dog-patches-adopted-after-waiting-1913-days/
ax = (intakes_and_outcomes['time_in_shelter'].astype('timedelta64[s]') / (60**2 * 24)).plot.hist(bins = 2000, xlim = [0, 100],figsize = (10,5), label = "Histogram of time in shelter", color = 'xkcd:ocean blue' )
ax.set_title('Time of Shelter Stay in Days')
#add red line at the median of 5 days
plt.pyplot.axvline(x = 5, color = 'red', label = 'Median time of stay')
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Frequency of Animals', fontsize=20)
ax.set_xlabel('Time of Stay in Days', fontsize=20)
ax.legend()
<matplotlib.legend.Legend at 0x2dcec5e5610>
The distribution appears to be largely exponential with a relatively large lambda value. However, there also seems to be a cluster around the 5-day mark. Let’s look at the distribution of adopted and non-injured pets specifically to see if this affects our distribution of an animal's shelter stay.
ax = (intakes_and_outcomes[((intakes_and_outcomes['outcome_type'] == 'Adoption')) ]['time_in_shelter'].astype('timedelta64[s]') / (60**2 * 24)).plot.hist(bins = 2000, xlim = [0, 100],figsize = (10,5), label = "Histogram of time in shelter for adopted animals", color = 'xkcd:ocean blue' )
ax.set_title('Time of Shelter Stay in Days for Adopted Animals')
#add red line at the median of 5 days
plt.pyplot.axvline(x = 5, color = 'red', label = 'Median time of stay')
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Frequency of Animals', fontsize=20)
ax.set_xlabel('Time of Stay in Days', fontsize=20)
ax.legend()
<matplotlib.legend.Legend at 0x2dc5e936bb0>
Filtering for adoptions only gives us a distribution that looks more "normally" centered around a mean value. The other types of outcomes listed are likely representative of different processes and resolve quickly as they are present in our original distribution but not this one. For example, dying or euthanized animals may be at the shelter for hours rather than the days of adopted animals. Furthermore, animals returned to their owners or transferred to another facility may only be held for the shelter for a small amount of time. Filtering by animals that are adopted out gives more relevant information on an average animal's time of stay in the shelter. This allows us to consider and focus on how to improve the traditional shelter adoption process without factors that are not always under a shelter's control such as sick or dying animal intakes. From the shelter's point of view, we want to maximize adoptions so focusing on adoption time may give us data with practical value to improve the adoption process.
One anecdotal observation of staff in animal shelters is that black dogs are adopted less often than those with lighter coats. This may be to the perception among adopters that an animal with a darker coat is more aggressive. A study of adoption and euthanasia found, using empirical evidence, that Black Dog Syndrome was a real factor in Sacramento County animal adoptions. The study can be found here: https://web.archive.org/web/20100401052756/http://www.animalsandsociety.org/assets/library/78_jaawsleeper.pdf
The 2002 study uses logistic regression to determine differences between euthanasia and adoption between groups of animals. However, the Austin Animal Shelter maintains a "no-kill" policy. According to a press release from the shelter (https://www.austintexas.gov/news/austin-animal-center-has-no-more-space-asks-community-help), "No-kill shelters strive to only euthanize animals who are irremediably suffering or pose a significant public safety threat" and that "Austin Animal Center is required by city ordinance to meet or exceed a 95% live outcome rate." Yet, we note that the overall kill rate is higher than 5% as demonstrated above in the initial plots sections.
The original study used logistic regression on animals euthanized. However, Austin's "no-kill" policy makes using logistic regression less desirable here as euthanasia is not necessarily the result of an animal that has been at the shelter a long time. As a result, we will focus on length of stay within the shelter to determine whether an animal is desired by potential adopters. This metric was also used by another study of two New York "no-kill" shelters that can be found here: https://www.tandfonline.com/doi/abs/10.1080/10888705.2013.740967.
Our analysis assumes that these animals were up for adoption throughout their stay at the shelter. We also seeks to correct errors in methodology in the original Sacramento study which assumed that animals were euthanized if they were "not adopted." However, the study does not tell us if the shelter euthanized after a constant period an animal remained un-adopted for or if there were other factors that led to euthanasia. For example, an animal that may appear un-adoptable or more aggressive, as determined by shelter staff, may have been euthanized before other animals. As a result, the original analysis may be a result of biases in the shelter's own selection process and not indicative of the behavior of those adopting from the shelter.
#different outcomes for animals within our dataset
intakes_and_outcomes['outcome_type'].value_counts()
Adoption 67121 Transfer 41640 Return to Owner 23517 Euthanasia 9370 Died 1328 Rto-Adopt 966 Disposal 649 Missing 72 Relocate 23 Stolen 2 Lost 1 Name: outcome_type, dtype: int64
Let’s first consider only dogs that were adopted and that had normal status upon arrival. This will filter out cases of injury that may prolong the length of time an animal stayed at the shelter but not the time that it was available for adoption.
#create a new dataframe
BDS = intakes_and_outcomes[(intakes_and_outcomes['outcome_type'] == 'Adoption') & (intakes_and_outcomes['animal_type'] == 'Dog') ]
#check the amount of observations
len(BDS)
39201
Now, we will separate the observations of dogs that have a pure black coat and all others.
#summary for all black dogs time in shelter
BDS[BDS['color'] == 'Black']['time_in_shelter'].describe()
count 1835 mean 29 days 17:05:14.125340599 std 64 days 04:14:34.789641974 min 0 days 01:19:00 25% 4 days 07:56:00 50% 8 days 08:19:00 75% 27 days 18:35:00 max 1026 days 21:15:00 Name: time_in_shelter, dtype: object
#summary for all other dogs time in shelter
pd.concat([BDS, BDS[BDS['color'] == 'Black'], BDS[BDS['color'] == 'Black']]).drop_duplicates(keep=False)['time_in_shelter'].describe()
count 37366 mean 28 days 22:01:05.734089814 std 63 days 03:44:19.005190049 min 0 days 00:00:00 25% 4 days 08:58:15 50% 8 days 02:03:00 75% 26 days 19:38:00 max 1912 days 22:31:00 Name: time_in_shelter, dtype: object
Hmm, seems the summary for these distinct sets of observations appear nearly the same. However, many of the colors within the non-black category may have black patterns or otherwise. For example, a Black/White dog could have an almost entirely black coat and be considered a black dog. Indeed, there are actually labels for both Black/White and White/Black. Assuming that the first color is the dominant one in the animal's coat, let's see if we can further refine our data.
#colors in this set
# Note regex is used to gather all colors starting with Black but not ending with Black
print(BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['color'].value_counts())
#number of observations
print(len(BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]))
Black/White 5036 Black 1835 Black/Tan 1590 Black/Brown 1524 Black/Tricolor 292 Black/Gray 96 Black/Red 40 Black/Black 34 Black/Cream 24 Black/Fawn 10 Black/Buff 8 Black/Silver 6 Black/Orange 2 Black/Yellow 2 Black/Chocolate 2 Black/Blue 1 Name: color, dtype: int64 10502
#all the other colors of adopted dogs
print(pd.concat([BDS, BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['color'].value_counts())
#number of observations
print(len(pd.concat([BDS, BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['color']))
Brown/White 2257
Tan/White 2137
White 1725
Tan 1710
Tricolor 1601
...
Black Smoke/Black 1
White/Gray Tiger 1
Silver/Cream 1
Sable/Buff 1
Blue/Blue Tick 1
Name: color, Length: 294, dtype: int64
28699
#summary for primarily black dogs
BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['time_in_shelter'].describe()
count 10502 mean 27 days 18:33:48.705008570 std 60 days 06:19:23.458034094 min 0 days 00:49:00 25% 4 days 09:24:45 50% 8 days 00:48:00 75% 25 days 04:03:00 max 1167 days 19:55:00 Name: time_in_shelter, dtype: object
#summary for non-primarily black dogs
pd.concat([BDS, BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['time_in_shelter'].describe()
count 28699 mean 29 days 09:17:03.115091118 std 64 days 05:47:12.194607745 min 0 days 00:00:00 25% 4 days 08:44:00 50% 8 days 02:41:00 75% 27 days 03:50:30 max 1912 days 22:31:00 Name: time_in_shelter, dtype: object
Even considering dogs that are partially black, we get a similar distribution between both sets. In fact, the lighter coats actually have a larger mean stay in this case. However, this seems to be mostly the result of outliers as the medians between sets are within 2 hours of another.
#plot the distribution of dogs that have primarly black coats
ax = ( BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['time_in_shelter'].dt.days).plot(kind = "hist", bins = 1000, xlim = [0,100], figsize = (10,5), label = "Histogram of time in shelter for dogs with primarily black coats")
ax.set_title('Time of Shelter Stay in Days for Dogs that have Primarly Black Coats', size = 20)
#add red line at the median of 5 days
plt.pyplot.axvline(x = 5, color = 'red', label = 'Median time of stay')
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Frequency of Dogs', fontsize=20)
ax.set_xlabel('Time of Stay in Days', fontsize=20)
ax.legend()
<matplotlib.legend.Legend at 0x2dcf1508340>
#plotting the distribution of non-black dogs
ax = (pd.concat([BDS, BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['time_in_shelter'].dt.days).plot(kind = "hist", bins = 2000, xlim = [0,100], figsize = (10,5), label = "Histogram of time in shelter for dogs with lighter coats")
ax.set_title('Time of Shelter Stay in Days for Dogs that Do Not Have Primarly Black Coats', size = 20)
#add red line at the median of 5 days
plt.pyplot.axvline(x = 5, color = 'red', label = 'Median time of stay')
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Frequency of Dogs', fontsize=20)
ax.set_xlabel('Time of Stay in Days', fontsize=20)
ax.legend()
<matplotlib.legend.Legend at 0x2dcf2128910>
The distributions look very similar in all respects except in the number of observations. Let's perform a Mann-Whitney U rank test to see if there is a significant difference in the means here (just to be sure). Since we are only interested if dogs with primarily black coats are larger than their counterparts, we will use an upper test. We are using the Mann-Whitney U rank test since we cannot confidently assume that the distribution is normal here, based on the plots above.
from scipy.stats import mannwhitneyu
mannwhitneyu(BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['time_in_shelter'].dt.total_seconds()/(60**2 * 24), pd.concat([BDS, BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['time_in_shelter'].dt.total_seconds() / (60**2 * 24), alternative = 'greater' )
MannwhitneyuResult(statistic=149511225.5, pvalue=0.8842423151166274)
Due to the very large p-value, we reject the alternate hypothesis that dogs with coats that are primarily black stay in the shelter longer than their counterparts with lighter coats. Therefore, we cannot conclude that those that adopt from the Austin shelter show any prejudice against black or dogs with darker coats.
While we do not see any preference against black dogs within Austin, it is hard to say if this result generalizes particularly to the United States. Some of the underlying theories are due to cultural factors meaning this data probably does not give any inference to the behavior of adopters in other nations.
Similarly, cats have their own "Black Cat Syndrome" as they may be associated with witchcraft and supernatural events. Furthermore, some cultures regard them as bad luck. https://www.history.com/news/black-cats-superstitions
Let's perform the same test procedure again, this time for cats within the shelter.
BCS = intakes_and_outcomes[(intakes_and_outcomes['outcome_type'] == 'Adoption') & (intakes_and_outcomes['animal_type'] == 'Cat') ]
#check the amount of observations
len(BCS)
26900
#colors in this set
# Note regex is used to gather all colors starting with Black but not ending with Black
print(BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['color'].value_counts())
#number of observations
print(len(BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]))
Black 3357 Black/White 2448 Black/Black 36 Black/Gray 4 Black/Silver 2 Black/Brown 2 Black/Orange 1 Name: color, dtype: int64 5850
#all the other colors of adopted cats
print(pd.concat([BCS, BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['color'].value_counts())
#number of observations
print(len(pd.concat([BCS, BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BDS[BDS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['color']))
Black/White 5036
Brown Tabby 4047
Brown Tabby/White 2194
Orange Tabby 1911
Black 1835
...
Orange/Orange Tabby 1
Red Tick 1
Tortie Point/Seal Point 1
White/Lilac Point 1
Tan 1
Name: color, Length: 201, dtype: int64
31552
BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['time_in_shelter'].describe()
count 5850 mean 43 days 09:36:08.379487180 std 53 days 00:19:05.619908981 min 0 days 01:44:00 25% 9 days 22:10:45 50% 30 days 22:41:00 75% 57 days 21:25:45 max 854 days 19:52:00 Name: time_in_shelter, dtype: object
pd.concat([BCS, BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['time_in_shelter'].describe()
count 21050 mean 38 days 09:53:23.247505938 std 55 days 07:19:21.759482908 min 0 days 00:00:00 25% 7 days 03:02:00 50% 24 days 22:28:00 75% 51 days 21:47:15 max 1744 days 22:33:00 Name: time_in_shelter, dtype: object
A little more difference for darker coats here than before. However, the ratio of black cats to total cats is smaller than black dogs to total dogs. Also, note the significant difference in turnaround time between cats and dogs. A median of around 8 days for dogs to 25+ for cats. Both the mean and median for black cats are around 5 days higher than those without a black coat at 30 versus 24 days and 43 versus 38 days respectively.
#plot the distribution of cats that have primarly black coats
ax = ( BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['time_in_shelter'].dt.days).plot(kind = "hist", bins = 800, xlim = [0,100], figsize = (10,5), label = "Histogram of time in shelter for cats with primarily black coats")
ax.set_title('Time of Shelter Stay in Days for Cats that have Primarly Black Coats', size = 20)
#add red line at the median of 30 days
plt.pyplot.axvline(x = 30, color = 'red', label = 'Median time of stay')
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Frequency of Animals', fontsize=20)
ax.set_xlabel('Time of Stay in Days', fontsize=20)
ax.legend()
<matplotlib.legend.Legend at 0x2dcf20f8d00>
#plotting the distribution of non-black cats
ax = (pd.concat([BCS, BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['time_in_shelter'].dt.days).plot(kind = "hist", bins = 800, xlim = [0,100], figsize = (10,5), label = "Histogram of time in shelter for cats with lighter coats")
ax.set_title('Time of Shelter Stay in Days for Cats that Do Not Have Primarly Black Coats', size = 20)
#add red line at the median of 24 days
plt.pyplot.axvline(x = 24, color = 'red', label = 'Median time of stay')
ax.set_facecolor('xkcd:grey')
ax.set_ylabel('Frequency of Animals', fontsize=20)
ax.set_xlabel('Time of Stay in Days', fontsize=20)
ax.legend()
<matplotlib.legend.Legend at 0x2dcf44c4670>
The distributions once again appear similar in shape. However, we once again cannot assume a normal distribution for either of these two sets. We will perform another Mann-Whitney U rank test to determine if the difference in means we noted above is significant enough.
mannwhitneyu(BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]['time_in_shelter'].dt.total_seconds()/(60**2 * 24), pd.concat([BCS, BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)], BCS[BCS['color'].str.contains('^Black[\/\w]*$' , regex=True)]]).drop_duplicates(keep=False)['time_in_shelter'].dt.total_seconds() / (60**2 * 24), alternative = 'greater' )
MannwhitneyuResult(statistic=67138649.5, pvalue=1.5498028547702165e-26)
With a p-value that is essentially zero, we conclude that the mean stay of cats with darker coats is greater than those without. While we cannot conclude that this is directly the result of adopters purposefully avoiding cats with dark coats, the conclusiveness of the test tells us that this result is almost certainly not by random noise. Thus, there does seem to be a sort of Black Cat Syndrome that influences length of stay within the shelter. Despite the fact that we cannot conclude it definitively, the behavior of adopters in avoiding black cats is a reasonable explanation as to why we can see such a difference in this data.
Using this information, A shelter can consider this factor into their adoption process. This could mean making black cats more visible for adopters visiting the shelter, making material that dispels the myths or biases that adopters have against black cats or promoting black cats in animal spotlights or other communications. Knowing is half the battle here and having empirical evidence that shows any potential bias can enable a shelter to act to confront it. In general, using hypothesis testing can allow a shelter or other non-profit to establish a clear ground truth. This can pave the way for policy that targets issues at their roots.
For a shelter, predicting which animals are most coveted by adopters and those that are less desirable can allow it to refocus its energy, time, and expertise to adopt out animals that may otherwise be passed over or not considered. Such a strategy can save the vital and often insufficient resources of a shelter by reducing the number of animals with long and costly shelter stays.
Some of the potential factors in adoption are easy to predict. Age of the animal, what species it is, and temperment are factors that many adopters have in mind before they even visit a shelter. Some of these are fairly obvious and considered by shelters on a non-empirical basis. Age of an animal is a clear example as most shelters are aware in some capacity that older pets are often passed over by adopters in favor of younger animals.
We will use a regression decision tree to demonstrate how even simple machine learning models can provide significant information when it comes to determining potential animal outcomes.
This will require several adjustments to our dataframe including dropping redundant information from our data and reincoding it to be understood by our regression tree.
#import tree and spliting function from sklearn library
from sklearn import tree
from sklearn.model_selection import train_test_split
#create a tree with a maximum depth of 25
clf = tree.DecisionTreeRegressor(max_depth= 25)
#copy our data into a new dataframe for the tree
tree_data = intakes_and_outcomes.copy()
#remove id and name since they give us little to no info
tree_data.drop(['animal_id','name'], axis = 1 , inplace= True)
Again, we will only consider the traditional adoption process to make things simple.
#adoptions only
tree_data = tree_data[tree_data['outcome_type'] == 'Adoption']
#remove the now useless column and the subtype as well
tree_data.drop(['outcome_type', 'outcome_subtype'], axis = 1 , inplace= True)
#we'll also drop address
tree_data.drop(['found_location'], axis = 1 , inplace= True)
We need to convert our datetime values to floats to be read by the model. Let's start with the dates by defining a converting function.
#datetime to float
def datetime_to_float(date):
return date.timestamp()
tree_data['time_in_shelter'].dt.total_seconds()
8 1996200.0
14 7876200.0
16 1570320.0
19 973740.0
20 1125240.0
...
146021 517740.0
146023 260820.0
146028 443700.0
146029 441420.0
146036 431820.0
Name: time_in_shelter, Length: 67121, dtype: float64
#apply on all our datetime features
tree_data['intake_date'] = tree_data['intake_date'].apply(datetime_to_float)
tree_data['date_of_birth'] = tree_data['date_of_birth'].apply(datetime_to_float)
tree_data['release_date'] = tree_data['release_date'].apply(datetime_to_float)
#timedelta is a little different
tree_data['time_in_shelter'] = tree_data['time_in_shelter'].dt.total_seconds()
Now let’s convert the sex_upon_intake and sex_upon_outcome into was_spayed and sex columns to make them binary properties for our tree.
#def function to determine if the animal was fixed or not
def was_fixed(intake, outake):
if (intake == "Intact Female" and outake == "Spayed Female") or (intake == "Intact Male" and outake == "Neutered Male"):
return 1
return 0
tree_data['was_fixed'] = tree_data.apply(lambda x: was_fixed(x.sex_upon_intake, x.sex_upon_outcome), axis=1)
tree_data['sex_upon_intake'].value_counts()
Intact Male 25275 Intact Female 24943 Neutered Male 8465 Spayed Female 8228 Unknown 210 Name: sex_upon_intake, dtype: int64
tree_data.head()
| intake_date | intake_type | intake_condition | animal_type | sex_upon_intake | breed | color | date_of_birth | sex_upon_outcome | release_date | time_in_shelter | was_fixed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 1.380625e+09 | Stray | Normal | Dog | Intact Female | Podengo Pequeno Mix | Black | 1.254355e+09 | Spayed Female | 1.382621e+09 | 1996200.0 | 1 |
| 14 | 1.380627e+09 | Owner Surrender | Injured | Dog | Spayed Female | Labrador Retriever/Chinese Sharpei | Tan | 1.176422e+09 | Spayed Female | 1.388503e+09 | 7876200.0 | 0 |
| 16 | 1.380627e+09 | Stray | Normal | Dog | Neutered Male | Great Pyrenees Mix | White | 1.317427e+09 | Neutered Male | 1.382197e+09 | 1570320.0 | 0 |
| 19 | 1.380628e+09 | Stray | Normal | Cat | Intact Male | Domestic Shorthair Mix | Brown Tabby | 1.374019e+09 | Neutered Male | 1.381602e+09 | 973740.0 | 1 |
| 20 | 1.380628e+09 | Stray | Normal | Cat | Intact Male | Domestic Shorthair Mix | Brown Tabby | 1.374019e+09 | Neutered Male | 1.381754e+09 | 1125240.0 | 1 |
Let's also remove the few unknown sexes in the data.
#apply and create new column in dataframe
tree_data.drop(tree_data[tree_data['sex_upon_intake'] == 'Unknown'].index, inplace= True)
#def function to determine if the animal is male or not
def is_male(intake):
if "Male" in intake:
return 1
return 0
#apply and create new column in dataframe
tree_data['is_male'] = tree_data['sex_upon_intake'].apply(is_male)
tree_data.head()
| intake_date | intake_type | intake_condition | animal_type | sex_upon_intake | breed | color | date_of_birth | sex_upon_outcome | release_date | time_in_shelter | was_fixed | is_male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 1.380625e+09 | Stray | Normal | Dog | Intact Female | Podengo Pequeno Mix | Black | 1.254355e+09 | Spayed Female | 1.382621e+09 | 1996200.0 | 1 | 0 |
| 14 | 1.380627e+09 | Owner Surrender | Injured | Dog | Spayed Female | Labrador Retriever/Chinese Sharpei | Tan | 1.176422e+09 | Spayed Female | 1.388503e+09 | 7876200.0 | 0 | 0 |
| 16 | 1.380627e+09 | Stray | Normal | Dog | Neutered Male | Great Pyrenees Mix | White | 1.317427e+09 | Neutered Male | 1.382197e+09 | 1570320.0 | 0 | 1 |
| 19 | 1.380628e+09 | Stray | Normal | Cat | Intact Male | Domestic Shorthair Mix | Brown Tabby | 1.374019e+09 | Neutered Male | 1.381602e+09 | 973740.0 | 1 | 1 |
| 20 | 1.380628e+09 | Stray | Normal | Cat | Intact Male | Domestic Shorthair Mix | Brown Tabby | 1.374019e+09 | Neutered Male | 1.381754e+09 | 1125240.0 | 1 | 1 |
Now, let's remove the sex columns.
tree_data.drop(['sex_upon_intake','sex_upon_outcome'], axis = 1 , inplace= True)
The intake condition column also skews our data significantly. Animals that are injured or otherwise have a condition that is not normal will potentially go through a different process prior being adopted adoption. This may be receiving medical care which subverts the traditional adoption pipeline as explored above. Let's remove this column as well.
tree_data.drop(['intake_condition'], axis = 1 , inplace= True)
The release_date column also needs to be removed as it can be used to the classifier to "cheat" and determine time of stay. Which is what we want to predict. Let's remove it as well.
tree_data.drop(['release_date'], axis = 1 , inplace= True)
tree_data.head()
| intake_date | intake_type | animal_type | breed | color | date_of_birth | time_in_shelter | was_fixed | is_male | |
|---|---|---|---|---|---|---|---|---|---|
| 8 | 1.380625e+09 | Stray | Dog | Podengo Pequeno Mix | Black | 1.254355e+09 | 1996200.0 | 1 | 0 |
| 14 | 1.380627e+09 | Owner Surrender | Dog | Labrador Retriever/Chinese Sharpei | Tan | 1.176422e+09 | 7876200.0 | 0 | 0 |
| 16 | 1.380627e+09 | Stray | Dog | Great Pyrenees Mix | White | 1.317427e+09 | 1570320.0 | 0 | 1 |
| 19 | 1.380628e+09 | Stray | Cat | Domestic Shorthair Mix | Brown Tabby | 1.374019e+09 | 973740.0 | 1 | 1 |
| 20 | 1.380628e+09 | Stray | Cat | Domestic Shorthair Mix | Brown Tabby | 1.374019e+09 | 1125240.0 | 1 | 1 |
Our remaining variables are categorical. Breed and color are categories but have a massive number of possible values. We will not be able to traditionally one-hot encode these features without suffering significant issues due to the high cardinality.
Instead, we will use leave one out encoding as it is referenced here: https://innovation.alteryx.com/encode-smarter/. More information on Owen Zhang's leave one out encoding strategy is also available here: https://datascience.stackexchange.com/questions/10839/what-is-difference-between-one-hot-encoding-and-leave-one-out-encoding. The main attribute of this encoding strategy is its ability to encode categories with very large cardinality which is what we have present in our data.
#import encoder library
import category_encoders as ce
#create encoder
encoder = ce.LeaveOneOutEncoder(cols=['breed','color','animal_type','intake_type', 'was_fixed', 'is_male'])
encoder
LeaveOneOutEncoder(cols=['breed', 'color', 'animal_type', 'intake_type',
'was_fixed', 'is_male'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LeaveOneOutEncoder(cols=['breed', 'color', 'animal_type', 'intake_type',
'was_fixed', 'is_male'])#split our data into training and test sets
train, test = train_test_split(tree_data, test_size=0.2)
#display the training data
train.head()
| intake_date | intake_type | animal_type | breed | color | date_of_birth | time_in_shelter | was_fixed | is_male | |
|---|---|---|---|---|---|---|---|---|---|
| 9657 | 1.398882e+09 | Owner Surrender | Cat | Domestic Shorthair Mix | Black/White | 1.367280e+09 | 3276240.0 | 0 | 0 |
| 119521 | 1.597403e+09 | Owner Surrender | Dog | Dalmatian Mix | White/Black | 1.340150e+09 | 5379060.0 | 0 | 1 |
| 27423 | 1.429467e+09 | Owner Surrender | Dog | Jack Russell Terrier/Chihuahua Shorthair | Brown | 1.407974e+09 | 589140.0 | 0 | 1 |
| 117492 | 1.590831e+09 | Stray | Dog | Pit Bull | Blue/White | 1.527638e+09 | 1909680.0 | 1 | 0 |
| 132421 | 1.634308e+09 | Owner Surrender | Cat | Domestic Shorthair | Brown Tabby | 1.628986e+09 | 354600.0 | 1 | 1 |
#fit our train data and save it to our dataframe
train = encoder.fit_transform(train, train['time_in_shelter'])
#display the encoded training data
train.head()
| intake_date | intake_type | animal_type | breed | color | date_of_birth | time_in_shelter | was_fixed | is_male | |
|---|---|---|---|---|---|---|---|---|---|
| 9657 | 1.398882e+09 | 2.805740e+06 | 3.410064e+06 | 3.453734e+06 | 3.027211e+06 | 1.367280e+09 | 3276240.0 | 3.131054e+06 | 2.875826e+06 |
| 119521 | 1.597403e+09 | 2.805604e+06 | 2.492579e+06 | 1.825184e+06 | 2.897332e+06 | 1.340150e+09 | 5379060.0 | 3.130919e+06 | 2.839629e+06 |
| 27423 | 1.429467e+09 | 2.805914e+06 | 2.492732e+06 | 1.470478e+06 | 2.278115e+06 | 1.407974e+09 | 589140.0 | 3.131228e+06 | 2.839807e+06 |
| 117492 | 1.590831e+09 | 2.836647e+06 | 2.492689e+06 | 4.647528e+06 | 3.787536e+06 | 1.527638e+09 | 1909680.0 | 2.746285e+06 | 2.875877e+06 |
| 132421 | 1.634308e+09 | 2.805929e+06 | 3.410199e+06 | 3.593382e+06 | 3.603237e+06 | 1.628986e+09 | 354600.0 | 2.746326e+06 | 2.839815e+06 |
#display the test data
test.head()
| intake_date | intake_type | animal_type | breed | color | date_of_birth | time_in_shelter | was_fixed | is_male | |
|---|---|---|---|---|---|---|---|---|---|
| 35652 | 1.441034e+09 | Stray | Cat | Domestic Shorthair Mix | Brown Tabby | 1.435709e+09 | 613740.0 | 1 | 0 |
| 49240 | 1.465141e+09 | Stray | Cat | Snowshoe Mix | Lynx Point | 1.461802e+09 | 2954100.0 | 1 | 1 |
| 17429 | 1.410271e+09 | Owner Surrender | Cat | Domestic Shorthair Mix | Brown Tabby | 1.252454e+09 | 178920.0 | 0 | 1 |
| 12892 | 1.403438e+09 | Owner Surrender | Cat | Domestic Shorthair Mix | Black | 1.398211e+09 | 1745460.0 | 1 | 0 |
| 96521 | 1.552060e+09 | Stray | Dog | Chihuahua Longhair Mix | White/Yellow | 1.546906e+09 | 349440.0 | 1 | 0 |
#encode the test data using leave one out encoding
test = encoder.transform(test, test['time_in_shelter'])
#display the encoded test data
test.head()
| intake_date | intake_type | animal_type | breed | color | date_of_birth | time_in_shelter | was_fixed | is_male | |
|---|---|---|---|---|---|---|---|---|---|
| 35652 | 1.441034e+09 | 2.836683e+06 | 3.410187e+06 | 3.453956e+06 | 3.603157e+06 | 1.435709e+09 | 613740.0 | 2.746319e+06 | 2.875926e+06 |
| 49240 | 1.465141e+09 | 2.836619e+06 | 3.410079e+06 | 3.757694e+06 | 2.884630e+06 | 1.461802e+09 | 2954100.0 | 2.746258e+06 | 2.839719e+06 |
| 17429 | 1.410271e+09 | 2.805941e+06 | 3.410207e+06 | 3.453992e+06 | 3.603291e+06 | 1.252454e+09 | 178920.0 | 3.131254e+06 | 2.839822e+06 |
| 12892 | 1.403438e+09 | 2.805839e+06 | 3.410135e+06 | 3.453862e+06 | 3.345296e+06 | 1.398211e+09 | 1745460.0 | 2.746290e+06 | 2.875883e+06 |
| 96521 | 1.552060e+09 | 2.836690e+06 | 2.492739e+06 | 1.134077e+06 | 4.150675e+06 | 1.546906e+09 | 349440.0 | 2.746326e+06 | 2.875936e+06 |
#fit the decision tree on the encoded training data
clf.fit(train.drop(['time_in_shelter'], axis = 1), train['time_in_shelter'])
DecisionTreeRegressor(max_depth=25)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(max_depth=25)
Now that we have a tree. We can visualize what features it is using to make decisions, unfortunately it’s not too helpful or intuitive due to our use of a regression tree and our encoded features. It does give us some insight into what features of the data are most important to split on initially.
#show a plot of the decision tree since this is a regression tree, the output is less helpful
tree.plot_tree(clf, max_depth= 5, fontsize = 5)
fig = plt.pyplot.gcf()
fig.set_size_inches(10, 5)
# fig.savefig('tree.png')
Let's see what attributes are most important when considering how long an animal will stay at the shelter (according to the tree).
clf.feature_importances_
array([5.42172295e-05, 1.15349129e-01, 1.45955929e-01, 1.03613548e-04,
1.33839844e-04, 3.36052496e-04, 3.46640080e-01, 3.91427138e-01])
print("The most important features, according to the tree are:")
#list features in order using feature importances
i = 1
for feature in [x for _, x in sorted(zip(clf.feature_importances_, train.drop(['time_in_shelter'], axis = 1).columns ), key=lambda pair: pair[0], reverse = True)]:
print(str(i) + ". " + feature)
i = i +1
The most important features, according to the tree are: 1. is_male 2. was_fixed 3. animal_type 4. intake_type 5. date_of_birth 6. color 7. breed 8. intake_date
The decision tree determines that gender and if the animal was fixed or not are fairly important feature for its initial splits. Does this tell us that the gender of the animal is most important for determining the time of its initial stay?
print("The length of stay for male animals in days is: " + str(tree_data[tree_data.is_male == 1]['time_in_shelter'].mean()/(60*60*24)))
print("The length of stay for female animals in days is: " + str(tree_data[tree_data.is_male == 0]['time_in_shelter'].mean()/(60*60*24)))
mannwhitneyu(tree_data[tree_data.is_male == 1]['time_in_shelter'].mean()/(60*60*24), tree_data[tree_data.is_male == 0]['time_in_shelter'].mean()/(60*60*24))
The length of stay for male animals in days is: 33.04557817542646 The length of stay for female animals in days is: 33.23985138876328
MannwhitneyuResult(statistic=0.0, pvalue=1.0)
print("The length of stay for animals fixed by the shelter in days is: " + str(tree_data[tree_data.was_fixed == 1]['time_in_shelter'].mean()/(60*60*24)))
print("The length of stay for animals not fixed by the shelter in days is: " + str(tree_data[tree_data.was_fixed == 0]['time_in_shelter'].mean()/(60*60*24)))
mannwhitneyu(tree_data[tree_data.was_fixed == 1]['time_in_shelter'].mean()/(60*60*24), tree_data[tree_data.was_fixed == 0]['time_in_shelter'].mean()/(60*60*24))
The length of stay for animals fixed by the shelter in days is: 31.63871448469921 The length of stay for animals not fixed by the shelter in days is: 36.8210564658331
MannwhitneyuResult(statistic=0.0, pvalue=1.0)
Not really, as it doesn't appear there is any significant difference between the length of stay of female and male animals in the data. It is likely that the reason the tree is considering these features more important is due to their binary nature as opposed to the other features that have high cardinality.
#score for training data
clf.score(train.drop('time_in_shelter', axis = 1), train['time_in_shelter'])
0.9999999941776533
#score for test data
clf.score(test.drop('time_in_shelter', axis = 1), test['time_in_shelter'])
0.9667989343943256
Looks like a fairly high coefficient of determination (R^2) for both sets. The test data is slightly less accurate in this metric, as is expected.
Let's predict a few values from the test set as well to give us an idea of how an animal shelter such as the Autin Animal Center can use this data to predict the length of stay of an incoming animal.
obs = test.sample(5)
obs
| intake_date | intake_type | animal_type | breed | color | date_of_birth | time_in_shelter | was_fixed | is_male | |
|---|---|---|---|---|---|---|---|---|---|
| 97707 | 1.554224e+09 | 2.836688e+06 | 3.410196e+06 | 3.453971e+06 | 3.345610e+06 | 1.522627e+09 | 432960.0 | 3.131238e+06 | 2.839813e+06 |
| 44646 | 1.458220e+09 | 2.836689e+06 | 2.492739e+06 | 2.281792e+06 | 2.736874e+06 | 1.452989e+09 | 367440.0 | 2.746326e+06 | 2.839815e+06 |
| 83961 | 1.528632e+09 | 2.836673e+06 | 2.492720e+06 | 1.834056e+06 | 1.852058e+06 | 1.370822e+09 | 961860.0 | 3.131204e+06 | 2.839793e+06 |
| 100168 | 1.557834e+09 | 2.836543e+06 | 3.409950e+06 | 3.453531e+06 | 3.909467e+06 | 1.557446e+09 | 5714340.0 | 2.746185e+06 | 2.839617e+06 |
| 17186 | 1.409845e+09 | 2.836690e+06 | 2.492739e+06 | 1.312161e+06 | 1.812277e+06 | 1.346717e+09 | 357600.0 | 3.131243e+06 | 2.839815e+06 |
clf.predict(obs.drop('time_in_shelter', axis = 1))
for animal in (clf.predict(obs.drop('time_in_shelter', axis = 1))/(60*60)):
print("The animal is predicted to spend " + str(round(animal / 24,0)) + " days and " + str(round(animal % 24,0)) + " hours in the shelter!" )
The animal is predicted to spend 5.0 days and 0.0 hours in the shelter! The animal is predicted to spend 4.0 days and 6.0 hours in the shelter! The animal is predicted to spend 11.0 days and 3.0 hours in the shelter! The animal is predicted to spend 66.0 days and 4.0 hours in the shelter! The animal is predicted to spend 4.0 days and 4.0 hours in the shelter!
for animal in (obs['time_in_shelter']/(60*60)):
print("The animal has actually spent " + str(round(animal / 24,0)) + " days and " + str(round(animal % 24,0)) + " hours in the shelter!" )
The animal has actually spent 5.0 days and 0.0 hours in the shelter! The animal has actually spent 4.0 days and 6.0 hours in the shelter! The animal has actually spent 11.0 days and 3.0 hours in the shelter! The animal has actually spent 66.0 days and 3.0 hours in the shelter! The animal has actually spent 4.0 days and 3.0 hours in the shelter!
The predictions are fairly close!
This regression tree represents a fairly simple model in the machine learning space. However, even simple techniques can give relatively large insights into the nature of animals shelters work to rehabilitate and rehoming. Many of these facilities see thousands of animals pass through their doors annually. This gives machine learning models a wealth of information to use in predicting future animal outcomes. Sheltering information also allows these data driven models, which are used extensively in business and money-making applications, to have a purely positive impact by helping to support the saving of animal lives.
Hopefully this tutorial has made clear the numerous dimensions of insight that can be obtained for animal shelters through data analysis. This ranges from factual determination using hypothesis testing, mapping geographical data using geocoding services, and using simple machine learning algorithms that can effectively determine what an animal's stay in the shelter may look like. In this analysis, we were able to determine, from relatively simple data, that "Black Cat Syndrome" was a factor in adoptions from the shelter while "Black Dog Syndrome" was not. We also determined the geographical distribution of animals passed into the shelter through use of a heatmap and plotting different intake types. Finally, we found that a simple decision tree can be an effective way to determine animal outcomes.
Even this simple information can provide a world of difference for an animal shelter. Having a reasonable estimation of how long an animal will stay can allow such a non-profit to allocate and find the necessary resources or save these resources for when they are really needed. Work in the animal sheltering industry is often time consuming, costly, and sometimes disheartening. Using information that is already present, animal shelters can make such work more predictable, resource efficient, and easier on shelter staff while potentially saving more animal lives in the process. The informational feedback of how a shelter is performing can enable shelters to further refine their operations as well to target specific weaknesses or community issues as they appear.
Tech has been used and tried in almost all aspects of improving human lives. Data is now collected in nearly every industry from determining driving times to treating illnesses. However, using tech and data to improve the lives of our furry companions is not as prolific. Through similar techniques, we hope that many leverage the increasingly familiar power of data science and use it make real improvements in the lives of animals. With adoption of such methods, we can open the door to a new era that makes finding homes for companion animals an efficient and worry-free prospect.